Introduction

The document contains a detailed analysis of forecast sale expectations of High-End Vacuums for Vac-Attack company in December 2020 using historical data. This analysis aims to determine if the company will meet the target and ensure the stock is on hand to match the demand.

The report describes data cleaning, summary statistics, visualisation and analysis using the statistical model fitting, implemented with R programming language.

Loading and installing the required R package

if (!require(tidyverse))
  install.packages("tidyverse")

if (!require(lubridate))
  install.packages("lubridate")

if (!require(plotly))
  install.packages("plotly")

if (!require(car))
  install.packages("stargazer")

if (!require(xgboost))
  install.packages("xgboost")

library(tidyverse)
library(lubridate)
library(stargazer)
library(car)
library(xgboost)

Importing Data

First import the Market data.

market_sale_col <- readr::read_csv("Data/MarketingCols.csv", col_names = FALSE)

market_sale_raw <- readr::read_csv("Data/MarketingSales.csv", 
                                   col_names = market_sale_col$X1)

Then import the December data for prediction.

dec_col <- readr::read_csv("Data/DecemberCols.csv", col_names = FALSE)

dec_ad_raw <- readr::read_csv("Data/DecemberAdData.csv", 
                                   col_names = dec_col$X1)

Quick explore of the read-in the data sets

dim(market_sale_raw)
## [1] 1796   11
glimpse(market_sale_raw)
## Rows: 1,796
## Columns: 11
## $ Date                   <chr> "1/01/16", "2/01/16", "3/01/16", "4/01/16", "5/…
## $ PositiveNews           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,…
## $ NegativeCoverage       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ Competition            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ AdvertisingSpend       <dbl> 4199.86, 14768.20, 11019.79, 3358.30, 5207.19, …
## $ Month                  <chr> "January", "January", "January", "January", "Ja…
## $ Day                    <chr> "Friday", "Saturday", "Sunday", "Monday", "Tues…
## $ `0508Line_247`         <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ UltraEdition_Available <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ COVID_Lockdown         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ Sales                  <dbl> 66, 84, 78, 70, 73, 67, 85, 78, 80, 87, 68, 83,…
dim(dec_ad_raw)
## [1] 31  4
glimpse(dec_ad_raw)
## Rows: 31
## Columns: 4
## $ Date             <chr> "1/12/20", "2/12/20", "3/12/20", "4/12/20", "5/12/20"…
## $ AdvertisingSpend <dbl> 10568.28, 8218.31, 4907.38, 8057.25, 21481.50, 484.17…
## $ Month            <chr> "December", "December", "December", "December", "Dece…
## $ Day              <chr> "Tuesday", "Wednesday", "Thursday", "Friday", "Saturd…

Tidy up the variables of both datasets

The next step is to tidy up the data sets for visualisation and modelling. I have also included year, month and day variables in both factor and numeric formats.

market_sale_final <- 
  market_sale_raw %>% 
  mutate(Date = dmy(Date),
         Ad_Spend =  AdvertisingSpend,
         Phone_24 = factor(`0508Line_247`),
         Positive_News = factor(PositiveNews),
         Negative_News = factor(NegativeCoverage),
         Competition = factor(Competition),
         Ultra_Edition = factor( UltraEdition_Available), 
         COVID_Lockdown = factor(COVID_Lockdown),
         Year = factor(year(Date)),
         Month = factor(Month, levels = month.name),
         Day = factor(Day, levels =
                        c("Monday", "Tuesday", "Wednesday", "Thursday",
                          "Friday", "Saturday", "Sunday")),
         Month_num = as.numeric(Month),
         Year_num = year(Date),
         Day_num = as.numeric(Day),
         Date_num = date(Date)) %>% 
  select(Sales, Ad_Spend, Date, Phone_24, Positive_News, Negative_News, 
         Competition, Ultra_Edition, COVID_Lockdown,
         Year, Month, Day, Day_num, Date_num, Month_num, Year_num)

dim(market_sale_final)
## [1] 1796   16
glimpse(market_sale_final)
## Rows: 1,796
## Columns: 16
## $ Sales          <dbl> 66, 84, 78, 70, 73, 67, 85, 78, 80, 87, 68, 83, 62, 94,…
## $ Ad_Spend       <dbl> 4199.86, 14768.20, 11019.79, 3358.30, 5207.19, 3962.76,…
## $ Date           <date> 2016-01-01, 2016-01-02, 2016-01-03, 2016-01-04, 2016-0…
## $ Phone_24       <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ Positive_News  <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0…
## $ Negative_News  <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Competition    <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Ultra_Edition  <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ COVID_Lockdown <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Year           <fct> 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2…
## $ Month          <fct> January, January, January, January, January, January, J…
## $ Day            <fct> Friday, Saturday, Sunday, Monday, Tuesday, Wednesday, T…
## $ Day_num        <dbl> 5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 1, 2…
## $ Date_num       <date> 2016-01-01, 2016-01-02, 2016-01-03, 2016-01-04, 2016-0…
## $ Month_num      <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ Year_num       <dbl> 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2…
dec_ad_final <-
  dec_ad_raw %>%
  mutate(
    Date = dmy(Date),
    Ad_Spend =  AdvertisingSpend,
    Phone_24 = factor(0, levels = c(0, 1)),
    Positive_News = factor(0, levels = c(0, 1)),
    Negative_News = factor(0, levels = c(0, 1)),
    Competition  = factor(0, levels = c(0, 1)),
    Ultra_Edition = factor(1, levels = c(0, 1)),
    COVID_Lockdown = factor(0, levels = c(0, 1)),
    Year = factor(2020, levels = market_sale_final$Year |> levels() ),
    Month = factor(Month, levels = month.name),
    Day = factor(Day, levels =
                   c("Monday", "Tuesday", "Wednesday", "Thursday",
                        "Friday", "Saturday", "Sunday")),
    Year_num = 2020,
    Month_num = 12,
    Day_num = as.numeric(Day),
    Date_num = date(Date)
  ) %>% 
  select(Ad_Spend, Date, Phone_24, Positive_News, Negative_News, 
         Competition, Ultra_Edition, 
         COVID_Lockdown, Year, Month, Day, Day_num, Date_num, Month_num, Year_num)

dim(dec_ad_final)
## [1] 31 15
glimpse(dec_ad_final)
## Rows: 31
## Columns: 15
## $ Ad_Spend       <dbl> 10568.28, 8218.31, 4907.38, 8057.25, 21481.50, 484.17, …
## $ Date           <date> 2020-12-01, 2020-12-02, 2020-12-03, 2020-12-04, 2020-1…
## $ Phone_24       <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Positive_News  <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Negative_News  <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Competition    <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Ultra_Edition  <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ COVID_Lockdown <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Year           <fct> 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2…
## $ Month          <fct> December, December, December, December, December, Decem…
## $ Day            <fct> Tuesday, Wednesday, Thursday, Friday, Saturday, Sunday,…
## $ Day_num        <dbl> 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6…
## $ Date_num       <date> 2020-12-01, 2020-12-02, 2020-12-03, 2020-12-04, 2020-1…
## $ Month_num      <dbl> 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12,…
## $ Year_num       <dbl> 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2…

Data exploration

Data visulisation

I will first generate some plots to look for any relationship.

The first plot is total unit sold versus time.

ggplot(market_sale_final, aes(x = Date, y = Sales)) +
  geom_path() +
  scale_x_date(date_breaks = "6 month", expand = c(0.01, 0.01), 
                 date_labels = "%b %Y") + 
  theme_light()

There is clearing some seasonality on the total unit sold versus time for adjustment

The second plot is to examine relationship between the total unit sold versus the total advertising spend.

ggplot(market_sale_final, aes(x = Ad_Spend, y = Sales)) +
  geom_point()

There is a weak positive correlation between Advertising Spend and total unit sold, with the pearson correlation of 41.1%.

ggplot(market_sale_final, aes(x = Date)) +
  geom_line(aes(y = Sales), linewidth = 1.5, color = "red", alpha = 0.8) + 
  geom_line(aes(y = Ad_Spend/100), color = "blue", alpha = 0.8) + 
  scale_y_continuous(
    name = "The units sold",
    # Add a second axis and specify its features
    sec.axis = sec_axis(~.*100, name="Total Advertising Spend ($)")
  ) +
  scale_x_date(date_breaks = "6 month", expand = c(0.01, 0.01), 
                 date_labels = "%b %Y") + 
  theme_light() +
  theme(
    axis.title.y = element_text(color = "red", size=13),
    axis.title.y.right = element_text(color = "blue", size=13)
  )

hist(market_sale_final$Sales)

The overall number of Sales is looking reasonably normal, thus there is no need to apply any additional normalization methods.

Descriptive summary

Total units sold versus years.

market_sale_final %>%
  group_by(Year) %>%
  summarise(Sales_mean = mean(Sales),
            Sales_sum = sum(Sales)) %>% 
  knitr::kable()
Year Sales_mean Sales_sum
2016 89.48087 32750
2017 82.02740 29940
2018 76.81370 28037
2019 78.07671 28498
2020 66.04179 22124

The total units sold appears to decline from year to year.

Total units sold versus months

market_sale_final %>%
  group_by(Month) %>%
  summarise(Sales_mean = mean(Sales),
            Sales_sum = sum(Sales))%>% 
  knitr::kable()
Month Sales_mean Sales_sum
January 66.17419 10257
February 65.78169 9341
March 64.52258 10001
April 59.66000 8949
May 65.76129 10193
June 71.71333 10757
July 71.52258 11086
August 80.06452 12410
September 88.28000 13242
October 91.63226 14203
November 107.91333 16187
December 118.73387 14723

The most of total units sold appears to be at the later months of the year.

Total units sold versus days of the week.

market_sale_final %>%
  group_by(Day) %>%
  summarise(Sales_mean = mean(Sales),
            Sales_sum = sum(Sales))%>% 
  knitr::kable()
Day Sales_mean Sales_sum
Monday 78.26848 20115
Tuesday 78.85156 20186
Wednesday 77.59766 19865
Thursday 78.82422 20179
Friday 79.33852 20390
Saturday 78.82101 20257
Sunday 79.21012 20357

The total units sold appears to very similar between the days of each week.

market_sale_final %>%
  group_by(Ultra_Edition) %>%
  summarise(Sales_mean = mean(Sales),
            Sales_sum = sum(Sales))%>% 
  knitr::kable()
Ultra_Edition Sales_mean Sales_sum
0 81.28650 74052
1 76.04181 67297
market_sale_final %>%
  group_by(Year, Ultra_Edition) %>%
  summarise(Sales_mean = mean(Sales),
            Sales_sum = sum(Sales))%>% 
  knitr::kable()
Year Ultra_Edition Sales_mean Sales_sum
2016 0 89.48087 32750
2017 0 82.02740 29940
2018 0 63.12222 11362
2018 1 90.13514 16675
2019 1 78.07671 28498
2020 1 66.04179 22124
market_sale_final %>%
  group_by(Positive_News) %>%
  summarise(Sales_mean = mean(Sales),
            Sales_sum = sum(Sales))%>% 
  knitr::kable()
Positive_News Sales_mean Sales_sum
0 78.22094 135948
1 93.12069 5401
market_sale_final %>%
  group_by(Negative_News) %>%
  summarise(Sales_mean = mean(Sales),
            Sales_sum = sum(Sales)) %>% 
  knitr::kable()
Negative_News Sales_mean Sales_sum
0 78.85618 140364
1 61.56250 985
market_sale_final %>%
  group_by(Positive_News, Negative_News) %>%
  summarise(Sales_mean = mean(Sales),
            Sales_sum = sum(Sales))%>% 
  knitr::kable()
## `summarise()` has grouped output by 'Positive_News'. You can override using the
## `.groups` argument.
Positive_News Negative_News Sales_mean Sales_sum
0 0 78.37573 134963
0 1 61.56250 985
1 0 93.12069 5401
market_sale_final %>%
  group_by(Competition)  %>%
  summarise(Sales_mean = mean(Sales),
            Sales_sum = sum(Sales))%>% 
  knitr::kable()
Competition Sales_mean Sales_sum
0 80.09143 111247
1 73.96069 30102
market_sale_final %>%
  group_by(COVID_Lockdown) %>%
  summarise(Sales_mean = mean(Sales),
            Sales_sum = sum(Sales))%>% 
  knitr::kable()
COVID_Lockdown Sales_mean Sales_sum
0 80.24484 139947
1 26.96154 1402
market_sale_final %>%
  group_by(Phone_24) %>%
  summarise(Sales_mean = mean(Sales),
            Sales_sum = sum(Sales))%>% 
  knitr::kable()
Phone_24 Sales_mean Sales_sum
0 76.10649 95057
1 84.62888 46292

Modelling using linear regression

Fitting the model

linear_reg_fit <-
  lm(
    Sales ~ Ad_Spend + (Year * Month) / Day +
      COVID_Lockdown + Ultra_Edition + Phone_24 + 
      Positive_News + Negative_News + Competition,
    data = market_sale_final
  )
anova(linear_reg_fit) %>% knitr::kable() 
Df Sum Sq Mean Sq F value Pr(>F)
Ad_Spend 1 151426.5271 151426.52709 5321.5665377 0.0000000
Year 4 94607.6547 23651.91367 831.1967181 0.0000000
Month 11 517816.9361 47074.26692 1654.3260182 0.0000000
COVID_Lockdown 1 45137.4566 45137.45660 1586.2608966 0.0000000
Ultra_Edition 1 300.6070 300.60700 10.5642002 0.0011809
Phone_24 1 10714.6209 10714.62092 376.5427976 0.0000000
Positive_News 1 9671.6523 9671.65230 339.8898608 0.0000000
Negative_News 1 7108.2051 7108.20513 249.8029063 0.0000000
Competition 1 108.5536 108.55356 3.8148865 0.0510013
Year:Month 42 13191.9256 314.09347 11.0381537 0.0000000
Year:Month:Day 354 8296.6084 23.43675 0.8236351 0.9874761
Residuals 1377 39182.8846 28.45525 NA NA
linear_reg_fit_final <- step(linear_reg_fit) 
## Start:  AIC=6374.49
## Sales ~ Ad_Spend + (Year * Month)/Day + COVID_Lockdown + Ultra_Edition + 
##     Phone_24 + Positive_News + Negative_News + Competition
## 
## 
## Step:  AIC=6374.49
## Sales ~ Ad_Spend + Year + Month + COVID_Lockdown + Ultra_Edition + 
##     Positive_News + Negative_News + Competition + Year:Month + 
##     Year:Month:Day
## 
##                   Df Sum of Sq    RSS    AIC
## - Year:Month:Day 354      8297  47479 6011.4
## - Ultra_Edition    1        21  39204 6373.5
## - Competition      1        39  39222 6374.3
## <none>                          39183 6374.5
## - Negative_News    1      6008  45191 6628.7
## - Positive_News    1      7675  46858 6693.7
## - COVID_Lockdown   1     15661  54844 6976.4
## - Ad_Spend         1    114379 153562 8825.6
## 
## Step:  AIC=6011.43
## Sales ~ Ad_Spend + Year + Month + COVID_Lockdown + Ultra_Edition + 
##     Positive_News + Negative_News + Competition + Year:Month
## 
##                  Df Sum of Sq    RSS    AIC
## - Ultra_Edition   1        23  47502 6010.3
## - Competition     1        30  47510 6010.6
## <none>                         47479 6011.4
## - Negative_News   1      8015  55494 6289.6
## - Positive_News   1      9093  56572 6324.1
## - COVID_Lockdown  1     15965  63444 6530.0
## - Year:Month     43     22732  70212 6628.0
## - Ad_Spend        1    135383 182862 8431.2
## 
## Step:  AIC=6010.29
## Sales ~ Ad_Spend + Year + Month + COVID_Lockdown + Positive_News + 
##     Negative_News + Competition + Year:Month
## 
##                  Df Sum of Sq    RSS    AIC
## - Competition     1        30  47532 6009.4
## <none>                         47502 6010.3
## - Negative_News   1      8015  55517 6288.3
## - Positive_News   1      9085  56588 6322.6
## - COVID_Lockdown  1     15965  63467 6528.7
## - Year:Month     43     23191  70694 6638.3
## - Ad_Spend        1    135379 182881 8429.4
## 
## Step:  AIC=6009.43
## Sales ~ Ad_Spend + Year + Month + COVID_Lockdown + Positive_News + 
##     Negative_News + Year:Month
## 
##                  Df Sum of Sq    RSS    AIC
## <none>                         47532 6009.4
## - Negative_News   1      8015  55547 6287.3
## - Positive_News   1      9081  56614 6321.4
## - COVID_Lockdown  1     15964  63497 6527.5
## - Year:Month     43     23830  71363 6653.3
## - Ad_Spend        1    135661 183193 8430.5
# Choose a model by AIC in a Stepwise Algorithm

linear_reg_fit_final$anova
##               Step  Df     Deviance Resid. Df Resid. Dev      AIC
## 1                   NA           NA      1377   39182.88 6374.490
## 2       - Phone_24   0 4.365575e-11      1377   39182.88 6374.490
## 3 - Year:Month:Day 354 8.296608e+03      1731   47479.49 6011.426
## 4  - Ultra_Edition   1 2.278925e+01      1732   47502.28 6010.288
## 5    - Competition   1 3.021734e+01      1733   47532.50 6009.430
anova(linear_reg_fit_final)
## Analysis of Variance Table
## 
## Response: Sales
##                  Df Sum Sq Mean Sq  F value    Pr(>F)    
## Ad_Spend          1 151427  151427 5520.900 < 2.2e-16 ***
## Year              4  94608   23652  862.331 < 2.2e-16 ***
## Month            11 517817   47074 1716.293 < 2.2e-16 ***
## COVID_Lockdown    1  45137   45137 1645.678 < 2.2e-16 ***
## Positive_News     1   9692    9692  353.375 < 2.2e-16 ***
## Negative_News     1   7520    7520  274.174 < 2.2e-16 ***
## Year:Month       43  23830     554   20.205 < 2.2e-16 ***
## Residuals      1733  47532      27                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(linear_reg_fit_final)
## 
## Call:
## lm(formula = Sales ~ Ad_Spend + Year + Month + COVID_Lockdown + 
##     Positive_News + Negative_News + Year:Month, data = market_sale_final)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.6621  -3.6102  -0.2807   3.3290  27.3765 
## 
## Coefficients: (1 not defined because of singularities)
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              6.565e+01  9.609e-01  68.324  < 2e-16 ***
## Ad_Spend                 1.416e-03  2.014e-05  70.328  < 2e-16 ***
## Year2017                 1.754e+00  1.331e+00   1.318 0.187765    
## Year2018                -1.706e+01  1.331e+00 -12.817  < 2e-16 ***
## Year2019                -2.031e+01  1.331e+00 -15.255  < 2e-16 ***
## Year2020                -2.133e+01  1.331e+00 -16.025  < 2e-16 ***
## MonthFebruary           -1.076e+00  1.355e+00  -0.794 0.427270    
## MonthMarch               2.185e-01  1.331e+00   0.164 0.869649    
## MonthApril              -1.851e+00  1.342e+00  -1.379 0.168043    
## MonthMay                -2.427e+00  1.331e+00  -1.824 0.068390 .  
## MonthJune                1.771e+00  1.342e+00   1.320 0.186943    
## MonthJuly                9.950e-01  1.330e+00   0.748 0.454593    
## MonthAugust              9.419e+00  1.332e+00   7.069 2.26e-12 ***
## MonthSeptember           1.897e+01  1.341e+00  14.143  < 2e-16 ***
## MonthOctober             2.745e+01  1.332e+00  20.605  < 2e-16 ***
## MonthNovember            4.178e+01  1.342e+00  31.126  < 2e-16 ***
## MonthDecember            4.905e+01  1.331e+00  36.860  < 2e-16 ***
## COVID_Lockdown1         -3.483e+01  1.444e+00 -24.125  < 2e-16 ***
## Positive_News1           1.290e+01  7.090e-01  18.196  < 2e-16 ***
## Negative_News1          -2.277e+01  1.332e+00 -17.094  < 2e-16 ***
## Year2017:MonthFebruary  -1.025e+01  1.926e+00  -5.323 1.16e-07 ***
## Year2018:MonthFebruary   7.149e+00  1.924e+00   3.715 0.000210 ***
## Year2019:MonthFebruary   3.171e+00  1.924e+00   1.649 0.099408 .  
## Year2020:MonthFebruary   5.005e+00  1.914e+00   2.615 0.008990 ** 
## Year2017:MonthMarch     -3.546e+00  1.883e+00  -1.883 0.059855 .  
## Year2018:MonthMarch      7.306e-01  1.883e+00   0.388 0.698001    
## Year2019:MonthMarch      2.308e+00  1.882e+00   1.226 0.220208    
## Year2020:MonthMarch      5.374e+00  1.910e+00   2.813 0.004961 ** 
## Year2017:MonthApril     -6.831e+00  1.898e+00  -3.599 0.000329 ***
## Year2018:MonthApril      4.347e+00  1.898e+00   2.291 0.022106 *  
## Year2019:MonthApril      8.629e+00  1.898e+00   4.546 5.86e-06 ***
## Year2020:MonthApril      8.313e+00  2.385e+00   3.486 0.000502 ***
## Year2017:MonthMay        1.501e+00  1.882e+00   0.797 0.425282    
## Year2018:MonthMay        6.937e+00  1.882e+00   3.686 0.000235 ***
## Year2019:MonthMay        1.105e+01  1.882e+00   5.870 5.22e-09 ***
## Year2020:MonthMay        1.115e+01  2.007e+00   5.555 3.21e-08 ***
## Year2017:MonthJune      -4.719e+00  1.898e+00  -2.487 0.012975 *  
## Year2018:MonthJune       3.308e+00  1.897e+00   1.743 0.081469 .  
## Year2019:MonthJune       1.280e+01  1.900e+00   6.740 2.14e-11 ***
## Year2020:MonthJune       8.612e+00  1.898e+00   4.538 6.06e-06 ***
## Year2017:MonthJuly      -1.327e+01  1.882e+00  -7.050 2.57e-12 ***
## Year2018:MonthJuly       8.134e+00  1.881e+00   4.324 1.62e-05 ***
## Year2019:MonthJuly       1.871e+01  1.881e+00   9.947  < 2e-16 ***
## Year2020:MonthJuly       8.028e+00  1.882e+00   4.265 2.11e-05 ***
## Year2017:MonthAugust    -9.244e+00  1.885e+00  -4.904 1.02e-06 ***
## Year2018:MonthAugust     5.231e+00  1.884e+00   2.776 0.005563 ** 
## Year2019:MonthAugust     1.030e+01  1.883e+00   5.472 5.09e-08 ***
## Year2020:MonthAugust     1.518e+01  1.885e+00   8.052 1.50e-15 ***
## Year2017:MonthSeptember -1.093e+01  1.897e+00  -5.760 9.95e-09 ***
## Year2018:MonthSeptember  1.773e+00  1.897e+00   0.935 0.350071    
## Year2019:MonthSeptember  1.298e+01  1.898e+00   6.839 1.10e-11 ***
## Year2020:MonthSeptember  1.043e+01  1.897e+00   5.496 4.46e-08 ***
## Year2017:MonthOctober   -1.429e+01  1.884e+00  -7.584 5.43e-14 ***
## Year2018:MonthOctober    2.755e+00  1.883e+00   1.463 0.143639    
## Year2019:MonthOctober    5.206e+00  1.885e+00   2.762 0.005803 ** 
## Year2020:MonthOctober    8.350e+00  1.882e+00   4.437 9.69e-06 ***
## Year2017:MonthNovember  -1.919e+01  1.897e+00 -10.118  < 2e-16 ***
## Year2018:MonthNovember   7.326e+00  1.898e+00   3.860 0.000117 ***
## Year2019:MonthNovember   1.230e+01  1.898e+00   6.482 1.18e-10 ***
## Year2020:MonthNovember   4.986e+00  1.899e+00   2.626 0.008728 ** 
## Year2017:MonthDecember  -1.355e+01  1.882e+00  -7.202 8.83e-13 ***
## Year2018:MonthDecember   6.887e+00  1.884e+00   3.656 0.000264 ***
## Year2019:MonthDecember   9.716e+00  1.882e+00   5.161 2.73e-07 ***
## Year2020:MonthDecember          NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.237 on 1733 degrees of freedom
## Multiple R-squared:  0.947,  Adjusted R-squared:  0.9451 
## F-statistic: 499.9 on 62 and 1733 DF,  p-value: < 2.2e-16

Model diagnostics

hist(linear_reg_fit_final$residuals)

plot(fitted(linear_reg_fit_final), resid(linear_reg_fit_final) )

# calculate the mean squared error
mse_linear <- 
  sum(market_sale_final$Sales - 
  predict.glm(linear_reg_fit_final,
           newdata = market_sale_final, 
           type = "response") ) ^2
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading

Modelling using Poisson regression

Fitting the model

poisson_reg_fit <-
  glm(
    Sales ~ Ad_Spend + (Year * Month) / Day +
      COVID_Lockdown + Ultra_Edition + Phone_24 + 
      Positive_News + Negative_News + Competition,
    data = market_sale_final,
    family = poisson()
)

car::Anova(poisson_reg_fit)
## Analysis of Deviance Table (Type II tests)
## 
## Response: Sales
##                LR Chisq  Df Pr(>Chisq)    
## Ad_Spend         1364.9   1  < 2.2e-16 ***
## Year               15.4   4   0.003935 ** 
## Month            3501.8  11  < 2.2e-16 ***
## COVID_Lockdown    356.0   1  < 2.2e-16 ***
## Ultra_Edition       0.4   1   0.506484    
## Phone_24                  0               
## Positive_News      88.1   1  < 2.2e-16 ***
## Negative_News      80.1   1  < 2.2e-16 ***
## Competition         0.3   1   0.594721    
## Year:Month        216.2  42  < 2.2e-16 ***
## Year:Month:Day    131.5 354   1.000000    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
poisson_reg_fit_final <- step(poisson_reg_fit)
## Start:  AIC=12543.89
## Sales ~ Ad_Spend + (Year * Month)/Day + COVID_Lockdown + Ultra_Edition + 
##     Phone_24 + Positive_News + Negative_News + Competition
## 
## 
## Step:  AIC=12543.89
## Sales ~ Ad_Spend + Year + Month + COVID_Lockdown + Ultra_Edition + 
##     Positive_News + Negative_News + Competition + Year:Month + 
##     Year:Month:Day
## 
##                   Df Deviance   AIC
## - Year:Month:Day 354   772.78 11967
## - Competition      1   641.53 12542
## - Ultra_Edition    1   641.69 12542
## <none>                 641.25 12544
## - Negative_News    1   721.35 12622
## - Positive_News    1   729.31 12630
## - COVID_Lockdown   1   997.21 12898
## - Ad_Spend         1  2006.15 13907
## 
## Step:  AIC=11967.41
## Sales ~ Ad_Spend + Year + Month + COVID_Lockdown + Ultra_Edition + 
##     Positive_News + Negative_News + Competition + Year:Month
## 
##                  Df Deviance   AIC
## - Competition     1   772.98 11966
## - Ultra_Edition   1   773.15 11966
## <none>                772.78 11967
## - Negative_News   1   875.60 12068
## - Positive_News   1   877.81 12070
## - Year:Month     43  1106.23 12215
## - COVID_Lockdown  1  1130.86 12324
## - Ad_Spend        1  2394.15 13587
## 
## Step:  AIC=11965.61
## Sales ~ Ad_Spend + Year + Month + COVID_Lockdown + Ultra_Edition + 
##     Positive_News + Negative_News + Year:Month
## 
##                  Df Deviance   AIC
## - Ultra_Edition   1   773.35 11964
## <none>                772.98 11966
## - Negative_News   1   875.80 12066
## - Positive_News   1   877.97 12069
## - Year:Month     43  1122.66 12229
## - COVID_Lockdown  1  1131.06 12322
## - Ad_Spend        1  2397.68 13588
## 
## Step:  AIC=11963.99
## Sales ~ Ad_Spend + Year + Month + COVID_Lockdown + Positive_News + 
##     Negative_News + Year:Month
## 
##                  Df Deviance   AIC
## <none>                773.35 11964
## - Negative_News   1   876.17 12065
## - Positive_News   1   878.24 12067
## - Year:Month     43  1135.63 12240
## - COVID_Lockdown  1  1131.43 12320
## - Ad_Spend        1  2397.95 13587
car::Anova(poisson_reg_fit_final)
## Analysis of Deviance Table (Type II tests)
## 
## Response: Sales
##                LR Chisq Df Pr(>Chisq)    
## Ad_Spend         1624.6  1  < 2.2e-16 ***
## Year              537.0  4  < 2.2e-16 ***
## Month            5536.1 11  < 2.2e-16 ***
## COVID_Lockdown    358.1  1  < 2.2e-16 ***
## Positive_News     104.9  1  < 2.2e-16 ***
## Negative_News     102.8  1  < 2.2e-16 ***
## Year:Month        362.3 43  < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Model diagnostics

hist(poisson_reg_fit$residuals)

plot(fitted(poisson_reg_fit), resid(poisson_reg_fit) )

AIC(poisson_reg_fit_final)
## [1] 11963.99
# calculate the mean squared error
mse_poisson <- 
  sum(market_sale_final$Sales - 
  predict.glm(poisson_reg_fit_final,
           newdata = market_sale_final, 
           type = "response") ) ^2
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading

Model comparison

Final results

linear_reg_predict <- 
  predict.lm(linear_reg_fit_final,
           newdata = dec_ad_final, 
           type = "response",
           se.fit = TRUE) 
## Warning in predict.lm(linear_reg_fit_final, newdata = dec_ad_final, type =
## "response", : prediction from a rank-deficient fit may be misleading
dec_ad_final$Sales <- round(linear_reg_predict$fit)


dec_ad_final %>% 
  select(Date, Sales) %>% 
  knitr::kable()
Date Sales
2020-12-01 108
2020-12-02 105
2020-12-03 100
2020-12-04 105
2020-12-05 124
2020-12-06 94
2020-12-07 111
2020-12-08 106
2020-12-09 116
2020-12-10 95
2020-12-11 99
2020-12-12 112
2020-12-13 123
2020-12-14 102
2020-12-15 100
2020-12-16 109
2020-12-17 105
2020-12-18 98
2020-12-19 106
2020-12-20 121
2020-12-21 98
2020-12-22 126
2020-12-23 103
2020-12-24 116
2020-12-25 94
2020-12-26 113
2020-12-27 95
2020-12-28 94
2020-12-29 135
2020-12-30 111
2020-12-31 107

The linear regression model predicts the total unit sales in December 2020 is 3331 with 95% confidence intervals of 3231 and 3430. Thus, the model suggests that the target sale of 3900 units in December 2020 is unlikely to meet.

market_sale_final$Type <- "Historical"
dec_ad_final$Type <- "Forecasted"


dec_ad_final$Sales_max <- with(linear_reg_predict, fit + 1.96*se.fit)
dec_ad_final$Sales_min <- with(linear_reg_predict, fit - 1.96*se.fit)

market_sale_final$Sales_max <- 
  market_sale_final$Sales_min <- 
  market_sale_final$Sales

g <- 
  market_sale_final %>% 
  bind_rows(dec_ad_final) %>% 
  mutate(Type = factor(Type)) %>% 
  ggplot(aes(x = Date, y = Sales, col = Type, group = Type)) +
  geom_path() +
  geom_ribbon(aes(ymin = Sales_min, ymax = Sales_max, fill = Type), alpha = 0.2) +
  scale_x_date(date_breaks = "6 month", expand = c(0.01, 0.01), 
                 date_labels = "%b %Y") + 
  theme_light()


plotly::ggplotly(g)

Notes

In conclusion, multiple linear regression appears to perform the best here

The model explains 94.51% of the variation in the total unit sold and hsould therefore be an accurate model for prediction.

An alternative method is using a time-series-based analysis with the forecast R package, which can better handle autocorrelation, seasonality, and other temporal patterns in the data. However, applying the method within the forecast R package requires the dataset to be converted to a Time-Series (ts) object. This process can be complicated, especially if we need to include other predictors such as “Total Advertising Spend” in the model for inference and prediction.